Getting started with xdatasets#

The xdatasets library enables users to effortlessly access a vast collection of earth observation datasets that are compatible with xarray formats.

The library adopts an opinionated approach to data querying and caters to the specific needs of certain user groups, such as hydrologists, climate scientists, and engineers. One of the functionalities of xdatasets is the ability to extract data at a specific location or within a designated region, such as a watershed or municipality, while also enabling spatial and temporal operations.

To use xdatasets, users must employ a query. A straightforward query to extract the variables t2m (2m temperature) and tp (Total precipitation) from the era5_reanalysis_single_levels dataset at two geographical positions (Montreal and Toronto) could be as follows:

query = {
    "variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
    "space": {
        "clip": "point", # bbox, point or polygon
        "geometry": {'Montreal' : (45.508888, -73.561668),
                     'Toronto' : (43.651070, -79.347015)
                    }
    }
}

An example of a more complex query would look like this. This query calls the same variables as above. However, instead of specifying geographical positions, GeoPandas.DataFrame is used to provide features (such as shapefiles or geojson) for extracting data within each of them. Each polygon is identified using the unique identifier Station, and a spatial average is computed within each one (aggregation: True). The dataset, initially at an hourly time step, is converted into a daily time step while applying one or more temporal aggregations for each variable as prescribed in the query. The xdatasets function ultimately returns the dataset for the specified date range and time zone.

query = {
    "variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
    "space": {
        "clip": "polygon", # bbox, point or polygon
        "aggregation": False,
        "geometry": gdf,
        "unique_id": "Station"
    },
    "time": {
        "timestep": "D",
        "aggregation": {"tp": np.nansum,
                        "t2m": [np.nanmax, np.nanmin]},

        "start": '1959-01-01',
        "end": '1961-05-31',
        "timezone": 'America/Montreal',
    },
}
Don't worry! Additional examples below will help in comprehending the range of possible queries.

Note

Don’t worry! Additional examples below will help in comprehending the range of possible queries.

[1]:
## Query climate datasets
[2]:
import xdatasets as xd
import intake
import numpy as np
import geopandas as gpd
import pandas as pd
from pathlib import Path
[3]:
bucket = Path('https://s3.us-east-2.wasabisys.com/watersheds-polygons/MELCC/json')

paths = [bucket.joinpath('023003/023003.json'),
         bucket.joinpath('031101/031101.json'),
         bucket.joinpath('040111/040111.json')]
[4]:
gdf = pd.concat([gpd.read_file(path).to_crs(4326) for path in paths]).reset_index(drop=True)
gdf
[4]:
Station Superficie geometry
0 023003 208.4591919813271 POLYGON ((-70.82601 46.81658, -70.82728 46.815...
1 031101 111.7131058782722 POLYGON ((-73.98519 45.21072, -73.98795 45.209...
2 040111 433.440893903503 POLYGON ((-74.06645 46.02253, -74.06647 46.022...
[5]:
gdf.hvplot(geo=True,
           tiles='ESRI',
           color='Station',
           alpha=0.8,
           width=750,
           height=450,
           legend='top',
           hover_cols=['Station','Superficie'])
[5]:
[6]:
%%time

# http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

query = {
    "variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
    "space": {
        "clip": "polygon", # bbox, point or polygon
        "aggregation": False,
        "geometry": gdf,
        "unique_id": "Station"
    },
    "time": {
#         "timestep": "D",
#         "aggregation": {"tp": np.nansum,
#                         "t2m": [np.nanmax, np.nanmin]},

        "start": '1979-01-01',
        "end": '2020-12-31',
#         "timezone": 'America/Montreal',
    },
}


xds = xd.Dataset(**query)
3it [00:26,  8.90s/it]
<xarray.Dataset>
Dimensions:    (latitude: 12, longitude: 13, time: 368184, Station: 3)
Coordinates:
  * latitude   (latitude) float32 44.5 44.75 45.0 45.25 ... 46.75 47.0 47.25
  * longitude  (longitude) float32 -75.0 -74.75 -74.5 ... -70.75 -70.5 -70.25
  * time       (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
Dimensions without coordinates: Station
Data variables:
    t2m        (Station, time, latitude, longitude) float32 nan nan ... nan nan
    tp         (Station, time, latitude, longitude) float32 nan nan ... nan nan
Attributes:
    institution:  ECMWF
    source:       Reanalysis
    title:        ERA5 forecasts
CPU times: user 13.7 s, sys: 5.41 s, total: 19.1 s
Wall time: 28.1 s

[7]:
ds_clipped = xds.bbox_clip(xds.data.isel(Station=0))
ds_clipped
[7]:
<xarray.Dataset>
Dimensions:    (time: 368184, latitude: 6, longitude: 6)
Coordinates:
  * latitude   (latitude) float32 46.0 46.25 46.5 46.75 47.0 47.25
  * longitude  (longitude) float32 -71.5 -71.25 -71.0 -70.75 -70.5 -70.25
  * time       (time) datetime64[ns] 1979-01-01 ... 2020-12-31T23:00:00
Data variables:
    t2m        (time, latitude, longitude) float32 272.8 272.1 ... 269.3 269.1
    tp         (time, latitude, longitude) float32 nan nan nan ... 0.0 0.0 0.0
Attributes:
    institution:  ECMWF
    source:       Reanalysis
    title:        ERA5 forecasts
[8]:
ds_clipped.t2m.hvplot()
[8]:
[ ]:

[9]:
%%time

# http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases

query = {
    "variables": {"era5_reanalysis_single_levels": ["t2m", "tp"]},
    "space": {
        "clip": "polygon", # bbox, point or polygon
        "aggregation": True,
        "geometry": gdf,
        "unique_id": "Station"
    },
    "time": {
        "timestep": "D",
        "aggregation": {"tp": [np.nansum],
                        "t2m": [np.nanmax, np.nanmin]},

        "start": '1979-01-01',
        "end": '2020-12-31',
        "timezone": 'America/Montreal',
    },
}


xds = xd.Dataset(**query)
3it [00:30, 10.32s/it]
<xarray.Dataset>
Dimensions:     (time: 368184, Station: 3)
Coordinates:
  * time        (time) datetime64[ns] 1978-12-31T19:00:00 ... 2020-12-31T18:0...
    longitude   (Station) float64 -70.94 -74.14 -74.27
    latitude    (Station) float64 46.72 45.18 46.08
    geom        (Station) int64 0 1 2
  * Station     (Station) object '023003' '031101' '040111'
    Superficie  (Station) object '208.4591919813271' ... '433.440893903503'
Data variables:
    t2m         (time, Station) float32 268.8 274.4 269.7 ... 269.3 271.3 266.5
    tp          (time, Station) float32 nan nan nan nan nan ... 0.0 0.0 0.0 0.0
Attributes:
    institution:    ECMWF
    source:         Reanalysis
    title:          ERA5 forecasts
    timezone:       America/Montreal
    regrid_method:  conservative
Processing tp: 100%|██████████| 2/2 [00:07<00:00,  3.59s/it]
CPU times: user 22.5 s, sys: 1.26 s, total: 23.7 s
Wall time: 39 s

[10]:
xds.data.to_dataframe().hvplot(x='time',
                               y=['t2m_nanmax','t2m_nanmin'],
                               grid=True,
                               width=750,
                               height=450,
                               groupby='Station',
                               legend='top',
                               widget_location='bottom')
[10]:
[11]:
xds.data.to_dataframe().hvplot(x='time',
                               y=['tp_nansum'],
                               grid=True,
                               width=750,
                               height=450,
                               groupby='Station',
                               legend='top',
                               widget_location='bottom')
[11]:
[12]:
xds.data
[12]:
<xarray.Dataset>
Dimensions:     (Station: 3, time: 15342)
Coordinates:
    longitude   (Station) float64 -70.94 -74.14 -74.27
    latitude    (Station) float64 46.72 45.18 46.08
    geom        (Station) int64 0 1 2
  * Station     (Station) object '023003' '031101' '040111'
    Superficie  (Station) object '208.4591919813271' ... '433.440893903503'
  * time        (time) datetime64[ns] 1978-12-31 1979-01-01 ... 2020-12-31
Data variables:
    t2m_nanmax  (time, Station) float32 271.1 276.4 271.5 ... 273.7 276.1 272.7
    t2m_nanmin  (time, Station) float32 268.8 274.4 269.7 ... 269.0 271.3 266.3
    tp_nansum   (time, Station) float32 0.0 0.0 0.0 ... 0.002386 0.002362
Attributes:
    institution:  ECMWF
    source:       Reanalysis
    title:        ERA5 forecasts
    timezone:     America/Montreal
[ ]: